User Guide

Gregory Kozlovsky

This guide is written not for search users, but rather for search service administrators who create and configure a search service. Although we try to use explicit terminology as much as possible, sometimes the word "user" will be used instead of "search service administrator," when the meanings is clear from the context.

locust consists of a multi-threaded spidering program (spider) that downloads documents from the internet, a merging tool to convert and merge spider journals into fast searchable reverse indexes (deltamerge), a tool to manage and query the database (stortool), a search daemon (searchd), a CGI search front-end, producing template based HTML output or an XML document, and auxiliary tools. Following is a general description of the locust locust parts and their interaction.

The spider downloads documents from the internet, parses them for keywords and links, then stores document attributes and compressed documents themselves in MySQL tables and writes the spider journal. The spider journal contains records relating documents with keywords, links and document modification times. To increase performance, the spider uses threads to run multiple download workers simultaneously.

The merging tool processes the spider journal and creates a reverse index allowing fast search of documents with keywords. In the case of respidering, the merging tool merges the old reverse index with the spider journal that contains only differences between the previous and current state of documents on the internet.

The search daemon receives query terms and query parameters from a front end CGI via a UNIX socket, finds documents satisfying the query conditions, ranks them and sends the resulting top documents back to the frontend.

The frontend CGI displays the search form and search parameter controls. When a query is submitted, the frontend sends query terms and query parameters to the search daemon via a UNIX socket. After the search daemon completes its work, the frontend receives the search results. The HTML frontend then formats the results into an HTML page according to configuration templates. The XML frontend sends the results to the client in an XML format.

The storage tool is used to create, delete and clean indexes, compute index statistics and so on.

locust also includes a suite of auxiliary tools that monitor the search availability, display spidering results, report broken links and analyze search user behavior. We will describe the auxiliary tools in a separate chapter.

1. Executables

In order to avoid problems with file access permissions, we advice to run all locust executables under a single account.

1.1 Creating an index database

The command lcreatedb creates an empty index database accessible using the MySQL user account locust as follows:

lcreatedb indexname

The command will ask you for the MySQL root password.

If for some reason a different account is required, the search administrator can edit lcreatedb, which is actually a shell script.

MySQL user account (not to be confused with the UNIX user account under which we run locust) is created with the MySQL statement, executed under the root account via, for example, the mysql interactive client:

grant usage on *.* to account@localhost identified by 'password';

where account is the desired account name and the password is the password to be set for this account.

1.1 Spider

The spider is started by the command

     spider [-N number] indexname

where number is the number of simultaneously running download threads and indexname is the name of an index database. If -N is omitted, the spider runs with one downloader.

To cleanly stop the spider, the command

     spider -E indexname

is used.

The spider reads configuration files sets.cnf, spider.cnf and storage.cnf in the index configuration directory named indexname. The spider outputs errors in the log file named indexname_spider.log.

1.2 Merging tool

After the spider has finished its work, the merging tool is started by the command

     deltamerge indexname

where indexname is the name of an index database.

The merging tool builds the files storing ordered reverse index of keywords, ordered forward and backward links and last document modification times. It also computes page ranking based on link popularity of documents. When the merging tool finishes merging a file, it erases the corresponding spider journal data.

If the merging tool fails, it can be run again, although a small probability exists that the data will be corrupted.

1.3 Search daemon

After the reverse index is produced, the database is ready for search and a search daemon can be started with the command

     /usr/local/sbin/asearchd -RD indexname

where indexname is the name of an index database. The options R and D mean that the program should be daemonized and restarted if crushed. Note that locust search daemon cannot be run under a priviledged (root) account for security reasons.

Search daemon can run during reindexing and merging. It will still produce useful results, although there may be some minor inconsistencies between index, stored documents and documents online.

Search daemon reads configuration files searchd.cnf and storage.cnf in the index configuration directory named indexname. The spider outputs errors in the log file named indexname_searchd.log.

The search daemon ranks document according to the document “weight,” a number between 0 and 1, where 1 corresponds to the most relevant documents. This number is computed from several partial weights (the weights of the title, the meta keywords, the meta description, the fontsize, the number of word occurrences and their distance from the document beginning, query term degree of clustering, link popularity) according to their share in the total weight specified in the searchd.cnf configuration file. The computation of partial weights will be described in a separate document.

1.4 Frontend CGI

locust search frontend is a CGI program that is executed by a web server. The frontend gets its input from the following sources:

URL-embedded CGI options that are placed by the web server into the QUERY_STRING environmental variable.
Cookies stored by the browser on the client side.
configuration file name.cnf, where name is the frontend name minus the .cgi suffix.
Environmental variables set up by the server.
The search daemon reply to the query.

Some of the frontend options can be specified in more than one source. In this case, CGI options have the first priority, followed by cookies and then the configuration file.

The HTML frontend uses the search results, input options and other information that may be necessary, to create an HTML page displaying search results and containing links and HTML form elements allowing to continue search and to navigate inside the search results. The XML frontend outputs the same information minus formatting and form elements in form of an XML document. In case the search fails, the frontend outputs the error information.

Search results are split into pages with each page containing the same number of results. Only one page at a time is displayed. The HTML frontend provides links to another result pages grouped into the result page navigation bar that can be placed above and/or below the search results. The number of results per page and the number of result page links in the navigator are configurable.

Depending on input options, the frontend produces the following output types.

Search results grouped by site. For every site containing search results, two top documents are output plus the internal site handle that can be subsequently used to create a page containing more results from this site.
Individual search results not grouped by site, ordered by ranking coefficient or date.
Document copy cached in the search engine storage, supplemented with the positions of query words in the documents and distinct number codes for every query word (the positions and codes are supposed to be used for color highlighting of the query words).
Documents containing links to a given page. Output for this type has the same structure that normal search results and likewise can include grouped by site or individual results.

Any name suitable for a CGI can be selected for a frontend. A hard link with this name must be made to the frontend executable named shtml.cgi for an HTML frontend or sxml.cgi for an XML frontend. The hard links are placed in a CGI directory of an HTTP server (see 3. Files and directories). There can be several frontends, differently configured, for one index.

The CGI option string is anything that follows the first ? character in the URL. When parameters are passed in the URL, they should be encoded according to standard RFC 1738, as described in the CGI Developer's Guide. If values are coming from HTML input fields, they are automatically encoded by the browser. In particular, spaces are changed to + or '%20' and special characters are replaced with %xx hexadecimal encoding. The options are separated with & signs and are in the form <name>=<value>. In this document values are shown in non-encoded form. The following is the description of the frontend options.

Search options

q=query: Query string.
ul=site [site ...]: Restrict search to the specified site(s). Option value site specifies a web site in the form http://www.sitename.domain (please no slash after the site specifier!).
s=rank | date: Result sorting mode. rank means sorting by relevance rank and date means sorting by last modification date. Sorting order is descending, so results with greater rank or latter modification date will be first ones.
ch=url: Output document copy cached in the search engine storage. Can be combined with q and cs parameters. If q and cs parameters are present, then the position and color code data needed to highlight query terms will appear in the output.
np=number: Result page number. Default value is 0.
ps=number: Number of results per page.
gr=off: Grouping results by site. If value of this parameter is off, then results are not grouped by site. Any other value is ignored. By default, if this option does not appear, results are grouped by site.
st=number: Internal site handle in the search engine storage. The search results are restricted to this site. This handle is output for every site in the site-grouped mode and can be subsequently used as a value of the st parameter to output more search results for the site.
spN=on: Space ID. N is one or more digits number. If value of this parameter is on, then search is restricted to a web space (a collection of sites) with the ID equal to N. Any other value is ignored. This parameter can be used multiple times, the union of all specified web spaces is included in the results. Web spaces must be created beforehand by using space-configuration procedure described in the stortool manual.
cs=charset: Source charset. Tells the frontend which charset is used in input query. This is required parameter if non-ASCII characters are used in query. Results of the query will also be presented in that charset.
ln=url: Link to page. Tells the frontend to find pages that have a link to the specified url. If protocol in url is omitted, then http:// is implied. If url is not found and last symbol is not '/', than '/' is added at the end.

Limiting search scope to certain fields of HTML documents

Parameters bd, ds, kw and tl can be used together in any combination. The only valid value is on. Any other value is ignored.

bd=on: Search in document body. If the value of this parameter is on, then search will be performed within the HTML tags <body>...</body>.
ds=on: Search in the meta description. If the value of this parameter is on, then search will be performed within the value of HTML meta tag named description.
kw=on: Search in the meta keywords. If the value of this parameter is on, then search will be performed within the value of HTML meta tag named keywords.
tl=on: Search in the title. If the value of this parameter is on, then advanced search will be performed within HTML tags <title>...</title>.

Advanced search

iq=words: Include words. All the words found in value of this parameter will be added to the query. If this parameter is not empty, then parameter q is ignored.
xq=words: Exclude words. All the words found in value of this parameter will be added to query with - sign before them.
im=p: Include mode. If value of this parameter is equal to p, then value of parameter iq is double-quoted, which means pages containing phrase must be found.
xm=p: Exclude mode. If value of this parameter is equal to p, then value of parameter xq is double-quoted, which means pages containing phrase must be excluded.
is=site_pattern [site_pattern ...]: Include sites. This parameter is used to restrict search by sites with site names containing site_pattern. If several patterns delimited by space are used, then the union of all sites fitting the patterns is taken. Examples: is=www.google, is=.com.
xs=site_pattern [site_pattern ...]: Exclude sites. This parameter is used to to exclude sites with site names containing site_pattern from results. If several patterns delimited by space are used, then the union of all sites fitting the patterns is taken.

Environmental variables

The following environmental variables are recognized by the frontend CGI.

QUERY_STRING: This variable is set by the web server and contains the CGI option string that follows the first ? character in the URL.
HTTP_COOKIE: Cookies sent by client browser together with the HTTP request are placed by the web server into this variable. The values of options contained in cookies are overridden by corresponding values in the CGI option string.
SCRIPT_NAME: Value is set by the web server and is used to determine the self name of the CGI executable and the default name of the configuration file to load (can be overridden by the cnf parameter).
REDIRECT_STATUS
REDIRECT_URL: These variables can possibly be set by the web server (for example, by Apache using its "AddHandler" and "Action" directives). If REDIRECT_STATUS is set, the frontend will get its self name from the REDIRECT_URL.

1.5 Storage tool

The storage tool is started by the command

     stortool [-S | -C | -P url] [-w] index

where index is the name of an index database. The option descriptions follow:

-S: Output database statistics.
-C: Clear the database content. Ask user for confirmation unless the option -w has been specified.
-P url: Print url "parents," i.e. the chain of url's starting with a url given in the sets.cnf file leading to the given url via links. Of many paths possible in general case, only the first one found by the spider is given.
-w: Clear the database content without user confirmation. Used with the option -C in scripts.

2. Configuration

The configuration files for an index are all located in the subdirectory named after the index and located in the locust configuration directory (see 3. Files and directories). Frontend configuration files are located in the frontend subdirectory of the configuration directory and must be named identically to corresponding frontend CGIs, except the suffix. We recommend to make links from index directories to frontend configuration files that configure the connection to the search daemon searching the index to which the directory belongs.

2.1 Configuring document sets

In order to have an index database geared for search integrated with a (multi-)hierarchical directory (tree of categories), the index works with a collection of sets of documents each identified by a unique integer. These sets then serve as leaves of a (multi-)hierarchical tree of categories described in the config file dir.cnf to be used by the search daemon to restrict search results to a particular category.

In the file sets.cnf the sets of documents to be indexed are defined. The basic building block is a set of documents residing on one server, called a server set. Server sets residing on one or several servers are combined into a document set. Documents sets can be further organized into a multi-hierarchical tree of categories.

Server set description

The main unit of sets.cnf is a server set description. For example,

    Server http://www.isn.ethz.ch {
	    Folder /researchpub/publihouse/
          Allow "/pubs/ph/details\.cfm.*" NoIndex
	    NoFolder /dossiers/terrorism/pubs/
	    Disallow ".*output_print\.cfm.*"
	    Start /php/collections/coll_overview.htm
	    Allow /php/collections/
          Document /docu/update/2003/12-december/e1217a.htm
    }

Here is a description of server set specification statements.

Start <path>: Start indexing from the document specified by the path on the server.
Allow <regex> <mode>: Download the documents with the path matching a regular expression. Mode can be any combination of NoIndex, NoFollow, NoOutside or empty. NoIndex means that the documents itself are not indexed but the links are followed when otherwise permitted. NoFollow means that the links are not followed. Without NoFollow outside links are followed if FollowOutsideLinks yes is specified. If NoOutside mode is specified, outside links for the documents matching this Allow statement are not followed, even if FollowOutsideLinks yes is specified. If <mode> is empty, the document is both indexed and the links followed.
Disallow <regex>: Do not download the documents with the path matching a regular expression.
Folder <partial path> <mode>: This statement is exactly equivalent to statements Start <folder path> and Allow <folder path>.* <mode>.
NoFolder <partial path>: This statement is exactly equivalent to Disallow <folder path>.*
Document <path>: Index individual document. Do not follow links.
Start <path>

Note, that the canonical server name should not end in '/'. The statements Disallow and Allow have a regular expression as their argument. Matching of regular expressions is case insensitive. Regular expressions containing spaces must be double-quoted. In this case, some characters contained in them must be escaped with the back slash according to the rules used for the C strings. The statements Start, Folder and NoFolder have a fixed server path as their argument. Indexing starts from one or more explicit (Start) and/or implicit (Folder) starting paths. For example, to start indexing from the root path, the statement Start / must be present.

Normally, a document must belong to only one server set. The spider checks this and gives a warning message if a document belongs to more than one server set.

Document set description

The server set descriptions are organized into named and numbered blocks called “Docsets.” A Server statement can appear only inside a Docset. For example,

    Docset "Research Institutions" 1024 {

	    // International Security Network
	    Server http://www.isn.ethz.ch

	    // Center for Nonproliferation Studies, Monterey Institute of
	    // International Studies
	    Server http://cns.miis.edu

	    // Center for Peace, Conversion and Foreign Policy of Ukraine
	    Server http://www.cpcfpu.org.ua {
		   Folder /eng/
	    }
    }

Note that server sets from one server may appear in several Docsets.

The rules for using Allow and Disallow statements

The Allow and Disallow statements inside each server set are processed in the order they appear in the config file. If the path matches an Allow statement, it is classified as “to be indexed” and the processing is stopped. If it matches a Disallow statement first, it is classified as “not to be indexed” and the processing is stopped. If there is at least one Allow, Folder or Document statement in a category, the “Disallow .*” statement is implicitly added at the end, otherwise, “Allow .* is added.

Shortcuts

The following shortcuts can be used. If a server set includes all the documents that can be found on the server, the block in the curly brackets can be omitted, for example:

    Server http://www.nato.int

Indexing in this case will start from the root path “/”.

Spidering external links

Uses of links external to site

There are two distinct reasons for following links outside: first is for the purpose of new server discovery and second for indexing the documents referenced by external links.

Using links to another website has several important applications:

Server discovery. New servers are discovered via links from servers specified in the config file. When a link to a document on a new server is found, the server is placed into the list of servers and indexing is started from the root path “/”.
Directly linked document indexing. It often makes sense to consider documents that are directly referenced by a site to be a part of this site. Outside servers are placed in the server list (if not already there) and only referenced documents are indexed.
Document lists. One of the ways to specify a set of documents to be indexed is by placing on your site a page containing a list of links to documents on a foreign site. When server discovery functionality (a) is not desired, this foreign site must be included in the config file and server discovery disabled.
Broken links detection for a website. Same as (b) without indexing documents.

Config file parameters controlling external links

The following set of config parameters allows all four cases to be treated in one framework.

    OutServers <regex> {
        <config_statements>
    }

If one or more OutServers statements are present, the spider will follows outside links to servers not included in the config file, provided their names match one of the regex’s following an OutServers statement. The indexing configuration for an outside server is taken from the first matching OutServers block. If there are no OutServers statements, links to outside servers are not followed.

How exactly the outside links are followed is determined by the following parameter.

FollowOutsideLinks yes/no: If FollowOutsideLinks is no, an outside server is indexed according to the configuration statements in the matching OutServers block, but the link itself, via which the server has been found is not followed. If, on the other hand, FollowOutsideLinks is yes, both the paths specified in the Start statements in the OutServers block and the outside links are put into the indexing queue.

Examples

To index only the immediate outside links for the ISN site, the following configuration can be used.

    FollowOutsideLinks yes
    Server www.isn.ethz.ch {
	    Start /
    }
    OutServers “.*” {
        Allow / NoFollow
    }

To index only the first page of any .ch server found, the following configuration can be used.

    FollowOutsideLinks no
    Server www.isn.ethz.ch {
        Start /
    }
    OutServers “.*\.ch” {
        Start /
        Allow / NoFollow
    }

2.2 Configuring the spider

The spider has the following configuration parameters.

MaxDocsPerServer - Limits the number of documents downloaded by the spider. If not set, there is no limit. The limit includes also failed download attempts.

MaxDocSize - Maximum size of downloaded document in bytes. Longer documents are truncated. Default value 1048576 bytes(1 Mb).

IgnoreSuffix - Specifies URL suffixes to be ignored on technical grounds, for example executables .exe. The suffixes are matched case insensitive. Most common file types to be ignored are already specified in the sample configuration file. Not to be used to exclude URLs on content ground, the appropriate place to do it is the sets.cnf file.

IgnoreRegex - Same as above, only regular expressions are used to exclude matching suffixes.

Regular expressions containing space must be double-quoted. In this case, some characters in them must be escaped with the back slash according to the rules used for the C strings. There may be multiple IgnoreSuffix and IgnoreRegex statements each containing multiple suffixes or regular expressions. Items specified in these statements are added to the set of suffixes or regular expressions to be used for filtering URLs.

Converter - Specify application to convert document types other than text or HTML, for example .pdf files. The form is as following

     Converter mimetypein mimetypeout "application $in $out"

where mimetypein is the original document MIME type, mimetypeout is text/html or text/plain, application is a path to converter application executable together with required command line arguments, and $in and $out are literal placeholders for actual input and output temporary files.

2.3 Configuring storage

User - MySQL user account used to create the index database. Normally, user account locust is used.

Passwd - MySQL password for the user specified above.

Host - Host on which MySQL server is running. Normally, localhost is used.

Stordir - Storage directory, by default /var/locdata.

WordCacheSize - Size of cache used to reduce number of database queries when finding the numerical handle for a word during spidering. The optimal size of the cache depends on many factors and we cannot give here the exact formula to determine it. In general, if there is enough memory, try to increase the size and see if spider performance increases. Look also at the hit ratio that the spider outputs at the end of it run. The default size is 50000.

HrefCacheSize - Size of cache used to reduce number of database queries when finding the numerical handle for a link during spidering. Same consideration apply as above, only the number of links in a document is usually smaller than the number of words and a smaller size than the one used for the word cache is adequate. The default size is 10000.

2.4 Configuring the search daemon

Currently, we use heavily modified ASPseek search daemon. As a result, some parameters that should been taken from the storage configuration file, are repeated here.

DBType - Database type, must always be set to mysql.

DBUser - MySQL user account used to create the index database. Normally, user locust is used.

DBPass - MySQL password for the user specified above.

DBHost - Host on which MySQL server is running. Normally, localhost is used.

DataDir - Storage directory, by default /var/locdata.

Port - Port number to be used to connect to the frontend.

AllowFrom - is used to limit access to the search daemon. Argument can be hostname, IP address or address/mask (in CIDR form). In latter case, access is allowed to the whole network. If there is no AllowFrom, access is unlimited.

MinFixedPatternLength - Minimal length of fixed part of word when using pattern in a query (like someth*). Words shorter than this value will be rejected with appropriate error message. Default value is 6.

Include ucharset.cnf - Must always be present.

Include stopwords.cnf - Stopwords for all or for selected language(s) can be included.

Ranking weight shares

Ranking weight shares are specified in percentage points and their sum must be 100.

TitleWeight - The title weight share.

MetakwdWeight - The meta keyword weight share.

MetadescrWeight - The meta description weight share.

FontsizeWeight - The fontsize weight share.

DistWeight - The weight depending on the number of keyword occurrences and distances from the beginning of the document share.

ClusterWeight - Clustering weight share.

RankWeight - Link popularity weight share.

2.5 Configuring the frontend CGI

A frontend configuration file name is constructed by replacing the .cgi suffix in the CGI executable name by the .cnf suffix. Configuration file is located in the directory /etc/locust/frontend/. If, for example, a CGI executable is named xscorm.cgi, it will use the configuration file xscorm.cnf.

The following options are defined in the configuration file.

daemonAddress xxx.xxx.xxx.xxx: Numeric IP address to be used to connect to the searchd daemon (local IP is 127.0.0.1).
port number: Port number to be used to connect to the searchd daemon (default 12345).
MaxExcerpts num: Defines the maximum number of document excerpts that are included in the search results.
MaxExcerptLen num: Defines the maximum length (in characters) of each excerpt string.
PagesPerScreen num: Defines the maximum number of result page links to be shown in the result page navigation bar.
ResultsPerPage num: Defines how many results will be shown on one page. Note that this value is overridden by the ps cookie if set, and by the ps CGI parameter.
Clones yes/no: If the value is yes, clones are shown in the search results inside the entries for their corresponding original documents, otherwise clones are not shown (default).
templateFile path: An HTML template file is specified here using either the absolute path or the path relative to the config directory /etc/locust. The option templateFile must be the last option in the file.

2.6 Configuring HTML output

Templates are named snippets of HTML code that are used to generate search result pages. Search result pages can be customised by editing templates.

Templates may contain placeholders for frontend CGI variables and for other templates to be nested in them. A placeholder for a CGI variable consists of the variable name preceded by the dollar sign '$'. A placeholder for a nested template consists of the template name preceded by the prefix "$T".

Templates are stored in a file (normally with the suffix .tmpl). This file must be specified in the corresponding frontend configuration file as the argument to the parameter "templateFile" using either the absolute path or the path relative to the config directory /etc/locust.

After search parameters that are passed through CGI arguments are read and search results are received from the search daemon, the frontend CGI substitutes variable values instead of placeholders. Next, nested templates are inserted instead of corresponding placeholders located in other templates. Template nesting can be several levels deep and the whole process gets quite complicated. Finally, first level templates are placed in the CGI output in the correct order.

To further complicate HTML page generation, in certain cases there is a need for more complex manipulations with template text than simple value substitution. In particular, certain CGI arguments have to be repeated in the href values in templates in the form of a CGI argument "&<var-name>=<value>", only if the <value> is non-empty. To set up correct initial value for checkbox and option elements, boolean and integer variables respectively should be used to output corresponding attributes "checked" or "selected", depending on variable value.

To do these manipulations in a clean and general way, we introduce the concept of slaved variable. The following line describes a slaved variable syntax:

     $<var-name>:<type>[(<arg>)]

A slaved variable value is computed from the master variable <var-name> value and the argument <arg> depending on the slaved variable type <type>.

Here is the description of the types of the slaved variables and the rules for generating their values.

Type "o" stands for CGI option. If the value of the master variable is not empty, the output will be the complete CGI option:
```
     "&<var-name>=<var-value>",
```
empty otherwise.
Example: If the value of a master variable q is "money", a slaved variable $q:o will be replaced by the string
```
     "&q=money"
```
if the value is not empty and by the empty string otherwise.
Type "c" stands for "checked". If <var-value>==<arg>, the output is the literal "checked", otherwise it is empty.
Example: In the HTML tag
```
     <input type="checkbox" name="tl" value="on" $tl:C(on)>
```
if the master variable tl have the value "on", the slave variable $tl:c(on) will be replaced by "checked", otherwise it will be empty.
Type "s" stands for "selected". If <var-value>==<arg>, the output is the literal "selected", otherwise it is empty.
Example: In the HTML tag
```
     <option value="20" $ps:s(20)>
```
if the master variable ps has the value "20", the slave variable $ps:s(on) will be replaced by "selected". Otherwise it will be empty.

For better separation between generic HTML code and customization code, the frontend recognizes include operators in the form

    Include <filepath>

where <filename> is an absolute path or a path relative to the template file directory to a file to be included.

We recommend that customization HTML code is kept, as far as it is practical, in separate include files. In particular, we recommend to use head.incl file for code to be included in the HTML head, top.incl for code to appear on top of the page (both to be included inside the template top) and bottom.incl for code to appear on the bottom of the page (to be included inside the template bottom).

Using Unix link mechanism, a single copy of customization files can be used in several installations.

Below we describe frontend variables and templates. Most of this information will not be needed by a web designer configuring search pages to suit the look and feel of his or her site. In most cases, it will be enough to create above mentioned include files to customize the top and the bottom templates. Some small changes in style made in other templates may be also desired. Thus, customization work does not require complete understanding of variables and of intricacies of HTML page generation. It will be enough to understand the basic principles and use the rest of this paragraph as a reference when required.

Variables

The following variables are recognized by the locust frontend. The variables describing search parameters have default values (often empty) and can be assigned non-default values via the front-end configuration files, cookies and CGI options. These variables are usually named after corresponding CGI options. The variables describing search results are assigned values received by the frontend from the search daemon.

Constants

VER - locust version.

Miscellaneous

cs - Charset used for result page. Always UTF-8.

Search forms

sf - Search form mode. When this variable has the value "adv", the advanced search form is created.
ff - The name of an HTML input form to put initial focus into.

Subset search

ul - Space separated list of site names. Copied from the CGI variable ul. Search is restricted to sites specified in the list.

Document field search restrictions

bd - If the value is "on", search in the document's body.
ds - If the value is "on", search in the meta tag "description".
kw - If the value is "on", search in the meta tag "keywords".
tl - If the value is "on", search in the document's title.

Result presentation options

ps - The number of results per page.
np - Current result page number. Numbers start from 0.
gr - If the value is "on", results are grouped by site.

Properties of result document set

f - The number of the first document on the current page in order of result ranking.
l - The number of the last document on the current page in order of result ranking.
nd - Total number of documents found.
ns - Total number of sites found.

Individual document properties

DB - Document size in kilobytes.
DC - Document MIME Content-Type (for example text/html).
DD - Document description, taken from the meta "description" tag.
DE - URL-escaped document URL.
DK - Document keywords, taken from the meta "keywords" tag.
DM - Document HTTP Last-Modified date.
DN - Document number in order of result ranking.
DS - Document size in bytes.
DT - The document's title.
DR - Document ranking weight.
DU - Document URL, unescaped.

Search statistics

Y - Time spent to perform query.

Variables depending on the current query

A - Full URL of the current page including the CGI options after the "?" separator. Serves as the argument for the form action tag.
NH - The path of the current page on the server without the option "np". Used as the first part of the value for "href" in navbar1, navleft, and navright templates (see below) of the result page navigation bar to form links to another results pages. Also used in the moreurls template.
LA - Link to the advanced query page on a basic search page. Is formed from the URL of the basic page by appending a string "sf=adv".
q - The query string copied from the CGI "q" option.
Q - The HTML-escaped query string.
ln - The target URL, when searching for documents that have links pointing to the given URL.

Variables used in the result page navigation bar to form links to other search result pages

NP - The page number, used in navleft, navbar0, navbar1 and navright templates as a value for the URL option "np" in a link to the corresponding results page. Starts from 1 (NP = NP0+1).
NP0 - The page number, used in navleft, navbar1 and navright templates as a value for the URL option "np" in a link to the corresponding result page. Starts from 0.

Information about document's site for the "moreurls" template

st - The handle of the site used to form the link to more results from the site.
SS - URL of the site.
SC - The number of documents found on the site.

Advanced search

iq - Query terms to include in the results.
ep - Exact phrase to be present in the results.
xq - Query terms to exclude from the results.
is - A site pattern used to restrict search to the sites matching it.
xs - A site pattern used to exclude sites matching it from search.

Templates

A template in the template file has the following form

    <template-type> <template-name> {
        <template-body>
    }

where <template-type> is one of two keywords Template or iTemplate, <template-name> is the name of the template, and <template-body> can be any snippet of HTML code with frontend variable and nested template placeholders inserted. The only difference between the two template types is that iTemplate (from inner template) is a short one line template whose body is substituted into other templates without the terminating new line character. This is done to preserve readability of HTML code produced by the frontend.

To accommodate cases where HTML code is conditional, for example the "Cached" link for HTML document versus the "Text version" link for PDF documents, dynamic templates are used. They take value of one of several templates depending on a condition or a template value versus an empty string. Dependence of dynamic templates on conditions and basic templates is hard-wired into the frontend, rather than being expressed in the templates file. In case of the result page navigator bar, the dynamic template "dnavbars" is generated using more complex hard-wired rules than simply a conditional substitution.

Top level templates

top - Top of the page above the search form.
sform - The basic search form.
asform - The advanced search form.
restop - Top of the results for individual (not grouped by site) mode.
restopgr - Top of the results for grouped by site mode.
resbot - Bottom of the results.
bottom - Bottom of the page below results.

Single result templates repeated according to number of results

res - Single search result in the individual mode and the first result in the grouped by site mode.
res2 - Second result in the grouped by site mode.
res0 - Common part of single result presentation nested inside res and res2.
moreurls - Nested inside res2 for more results from site.
dmoreurls - A dynamic template equal to moreurls or to an empty string in the case exactly two documents were found on the site.

Navigator between the result pages, nested in restop and resbot

navigator - Main navigator template.
navleft - Leftmost part of the bar with a link to the previous page.
navleft0 - Leftmost part of the bar when no previous page exists.
navbar1 - A link to a result page identified by its number.
navbar0 - The current page number looking like navbar1, but without the link.
navright - Rightmost part of the bar with a link to the next page.
navright0 - Rightmost part of the bar when no next page exists.
dnavleft - A dynamic template equal to navleft or to navleft0 if there is no previous page.
dnavbars - A dynamic template composed from the number of instances of navbar1 and one instance of navbar0 together equal to the maximum number of result page links to be shown in the result page navigation bar or the number of pages available, whichever is less.
dnavright - A dynamic template equal to navright or to navright0 if there is no next page.

Templates nested in res0

clone - Document clone URL.
dclones - A dynamic template assigned the list of current document clones.
cached - A link to the cached copy of the document, when the document is in HTML or text format.
textversion - A link to the cached text version of the document, when the document was converted from another format to HTML or text format.
dcached - A dynamic template equal to cached or textversion, depending on the document MIME type.
dexcerpts - A dynamic template that is assigned the specified number of excerpts from document's body, if excerpts are enabled and found, otherwise a portion of text from the beginning of the document. Query words in the excerpts are highlighted.
sizeb - Document size in bytes.
sizek - Document size in kilobytes.
dsize - A dynamic template equal to the sizeb template if document's size is less than 1024 bytes and equal to sizek otherwise.

Highlighting query words in title and excerpts

hiexopen - Inserted before a highlighted word.
hiexclose - Inserted after a highlighted word.

Opening and closing of excerpts

exopen - Inserted before an excerpt.
exclose - Inserted after an excerpt.

Cached document mode

cachetop - Inserted on top of a cached document.

Highlighting query words in a cached document (with colors)

hiopen0, hiopen1, hiopen2, hiopen3, hiopen4, hiopen5 - Inserted before a highlighted word according to the consecutive word number in the query, so that each word is highlighted with a different color. In a (rare) case when a query has more than six words, colors are reused.
hiclose - Inserted after a highlighted word.

Error and warning messages

usererror - Template used to output an error or warning message for the search user.

3. Files and directories

Here are the files and directories used by locust.

/usr/local/sbin - Executables are placed here by the installation procedure.

/etc/locust - Directory for locust configuration files. Configuration directories named after specific indexes are placed here. Also contains charset configuration that includes the file ucharset.cnf and the directory chartabs with charset definitions and stopwords configuration that includes the file stopwords.cnf and the directory stopwords with stopword definitions.

/etc/locust/frontend - Directory for the frontend configuration files, also contains the file port_assignments in which ports used for communication between frontends and search daemons must be recorded.

/var/locdata - This is the location of the index databases, can be changed in the storage.cnf file.

/var/locdata/mysql - MySQL data directory, can be changed in the storage.cnf file.

/var/www/cgi-bin - Frontend CGI's shtml.cgi and sxml.cgi are placed here by the installation procedure.

/var/locust/log – Logging file directory. Contains the global log file locust.log and index specific log file with the names formed as indexname_spider.log for spiders and indexname_searchd.log for search daemons, where indexname is an index database name.

/var/locust/run – Run time file directory to keep status information, for example the locwatch monitoring tool keeps here information on alerts, so repeated alerts can be blocked.

/etc/locust/auxtools – Auxiliary tools configuration directory.

/etc/init.d/locust - A start up script that restarts search daemons on boot-up.

4. Errors, warnings, and information messages

Error detection, error handling and error reporting constitute a surprisingly large and non-trivial part of any serious program. Executables of locust detect and report over one hundred different errors. Errors can be classified by their severity, subject, intended audience and a way an error message is delivered to its recipient. When talking about errors in this paragraph and in the rest of this chapter, we usually mean "errors, warnings and information messages."

We will not (at least not yet) describe every error here, but rather outline error treatment in locust. Most of individual error messages are self-explanatory.

Currently, the error system is not completely finished. What is missing are configuration options that will allow a search administrator to regulate error logging and output.

4.1 Error message format

Error messages have the following general form.

    <errcode> <level> [<time>] <message> <details>

where <errcode> is the numerical error code, <level> is the severity level, <time> is the current local time, <message> is the message verbal content, and <details> is a specific information, like a URL, or a file name. In some cases, for stylistic reasons, <message> and <details> can stand in different order. Some messages are not assigned error code yet and in this case the code is omitted.

4.2 Severity levels

By severity levels, locust messages are classified as following.

Information message - A message (for ex. "Max doc count reached for server http://www.mit.edu") allowing to monitor program execution.

Warning - A condition that may indicate some potential problem, but that also can be normal.

Error - A condition that should normally not occur. Error normally requires termination of the program.

Programming error - Internal inconsistency in the program. Always causes termination of the program.

Principal info - Principal info message (for ex. "Spider started") that must always be logged.

4.3 Error delivery

Log files

locust log files are located in the directory /var/locust/log. Messages concerning the starting and stopping a spider or a search daemon are written into the global log file locust.log. Otherwise messages are logged into index specific log files with the names formed as indexname_spider.log for spiders and indexname_searchd.log for search daemons, where indexname is an index database name. If an executable cannot open an index specific log file, it logs an error message in the global log file and exits.

Standard output

The spider and other executables except the search daemon print messages to the standart output. Note: In the future logging will become configurable, and it will be possible to block printing to the standard output completely or partially.

Email and mobile phone

The monitoring tool locwatch uses email to send error messages to the email addresses specified in its configuration file. SMS messages can be send via an email to SMS gateway.

If the spider is run via the cron, cron can be configured to send spider standard output to email addresses of interested users.

Browser

HTML and XML frontends send errors to the browser for the search service user to see.

4.4 Intended audience

locust messages have four different target audiences.

Search service users - messages are presented via the browser.
Spidered site administrators - messages are output into spider log files.
Search service administrator - messages are output into spider log files or presented via the browser.
Developer errors - errors intended for locust developers are output into spider log files or presented via the browser.

4.5 Errors by subject

locust messages are divided by subject into the following groups.

Configuration errors

Errors in the configuration files. Intended audience is search service administrators.

Resource errors

Lack of disk space, memory, failure opening files. Intended audience is search service administrators.

DNS errors

Errors resolving host IP address.

Downloading errors

Errors when attempting to download documents. These errors are also named "extended HTTP" errors because their codes are used in the document status as an extension of HTTP status.

HTTP errors

HTTP protocol status codes, except the 200 OK code.

Site errors

Spidered site errors include robots.txt errors, missing and misconfigured links, charset specification errors, some document errors (invalid UTF-8 characters, tags not closed).

Storage errors

Errors occuring when storing data into the index database and reading from it.

Program inconsistency

Program inconsistency messages are addressed to locust developers.

5. Auxiliary tools

locust includes a suite of auxiliary tools that monitor search availability, display spidering results, report broken links, and analyze user behavior. Most tools are implemented as CGIs and started via a browser based interface.

5.1 Monitoring tool locwatch

Monitoring tool locwatch is periodically executed by the cron. It checks the search services specified in its configuration file and if a service failed sends an alarm by email or SMS. The following crontab entry checks search services specified in a configuration file locwatch10 every 10 minutes.

     0-50/10 * * * * /usr/local/sbin/locwatch -f locwatch10

The configuration file can contain the following configuration parameters.

FailRecipients - Specify email addresses of fail message recipients.

RestoreRecipients - Specify email addresses of restore message recipients.

Server - Specify server name and in the following block the server IP can be specified (to avoid frequent DNS requests) and the search paths on the server to check with the minimum number of documents.

IP - Server IP inside a server block

Path - The path inside a server block. This parameter has two arguments. The first is a path on the server including search CGI and a query. The second is the minimum number of arguments this search must return to be considered working properly.

An example of a configuration file can be found in the distribution under the path files/etc/auxtools/locwatch10

5.2 Broken and misconfigured links reporting

The CGI tool sitechk reports broken and misconfigured links in a particular index database. When opened, the list of index databases is presented in a drop-down menu. Select desired database. To include redirected documents in the report, select the "Include redirected" checkbox. Then click the "Submit" button and a list of pages containing broken or misconfigured links will appear in the browser.

5.3 Spidered documents per site report

The CGI tool servrep reports the number of documents in a particular index database per site and per download status. It helps to find closed, moved or non-working sites.

5.4 User query statistics

sestat displays a list of the most frequent search queries and search terms. It is compiled for a given period of time (normally one month). Also, the tool compiles lists of keywords which gained the most and lost the most in popularity compared to the previous period of time.

5.5 locust log file splitter

logsplit helps an administrator to analyze spider logs that can be very large. It splits the log file into separate files for every error type.

6. Hints on organization of search services

When many indexes are run on a server, organization of search services becomes important.

6.1 Automatic periodic respidering

In Linux, the cron scheduling facility can be used to periodically respider indexes.

For an index to be periodically respidered, create an entry in the crontab table associated with the account used to run locust. locust installation script installs an initial crontab table with a commented out sample entry.

To edit the table use the command crontab -e. When inside the crontab editor, copy the sample entry, uncomment it and replace the index name by the one you desire. Set the desired respidering times using the crontab syntax.

6.2 Setting automatic resumption of search services on reboot

locust installation script installs an initialization script locust with a commented out sample entry into the Linux initialization and termination scripts directory /etc/init.d. To insure that search services are automatically restored upon rebooting, the script must be activated with the command

     chkconfig --add locust

For each instance of search daemon add an entry to the script /etc/init.d/locust. To do this, just copy provided sample entry, uncomment it and replace the index name by the one you desire.

The search daemons can be manually started with the command pre> /etc/init.d/locust start and stopped with the command pre> /etc/init.d/locust stop

7. Installing a search form on a web page

A search form can be installed on any web page by inserting the following HTML code.

<form method="GET" action="http://isn-search.ethz.ch/cgi-bin/example.cgi">
<table>
<tr>
	<td>
		<input type="text" name=q size=35 value="">
	</td>
	<td>
		<input type="submit" value="Search">
	</td>
</tr>
</table>

where http://isn-search.ethz.ch/cgi-bin/example.cgi is a locust frontend URL.

To place additional CGI options in the URL of the linked search page, the command line hidden input type tags can be inserted into the form. For example, the tag

<input type="hidden" name="gr" value="on">

puts the option gr=on in the URL of the search page. This option cases search results on the linked page to be grouped by sites.